Mechanism to invalidate data translation buffer entries a multiprocessor system

ABSTRACT

According to one embodiment a computer system is disclosed. The computer system includes a first central processing unit (CPU) having a translation buffer (TB) to store virtual to physical address translations, and a snoop filter coupled to the first CPU to mirror the operation of the first TB and implemented to search for entries upon receiving an invalidation request from a second CPU.

FIELD OF THE INVENTION

The present invention relates to computer systems; more particularly,the present invention relates to computer systems having multipleprocessors.

BACKGROUND

Computer systems have long used virtual memory to allow multipleprocesses to share a single processor. Typically, the operating system(OS) associates an address space with each process. Each address spaceis divided up into one or more multiple fixed size virtual pages. The OSmaps these virtual pages to physical pages and keeps the correspondingtranslations in a software structure called the Page Table. Because thePage Table can be quite large, processors usually cache thesetranslations in a hardware structure called a Translation Buffer (TB).

More specifically, a TB that caches translations for a data segment of aprocess is referred to as a Data Translation Buffer (DTB). User-levelloads and stores access the DTB to obtain the corresponding physicaladdress before accessing memory. A load or store suffers a DTB miss whenit accesses the DTB, but cannot find a corresponding translation. Insuch a case, either the software or a hardware page table walker bringsin the corresponding translation to the DTB. In the process, it may alsoevict an existing entry from the DTB. The pipeline is restarted andtypically the load or store is retried once the translation is broughtinto the DTB.

Whenever the OS changes a page table entry, it also invalidates thecorresponding entry in the DTB. The OS changes a page table entry eitherwhen it changes the virtual to physical mapping (possibly due to a pageswap to disk) or when it changes the protection level for a page. For auniprocessor system, this is fairly easy and does not take too much of aprocessor's bandwidth.

However, a DTB invalidate operation in a shared-memory multiprocessorsystem can take tens of thousands of cycles. This is because whenever aprocessor changes a page table entry corresponding to a shared virtualpage, corresponding entries in all DTBs in all of the other processorsmust be invalidated.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood more fully from the detaileddescription given below and from the accompanying drawings of variousembodiments of the invention. The drawings, however, should not be takento limit the invention to the specific embodiments, but are forexplanation and understanding only.

FIG. 1 illustrates one embodiment of a computer system;

FIG. 2 illustrates one embodiment of a CPU; and

FIG. 3 illustrates a flow diagram for one embodiment of mechanism toinvalidate data translation buffers.

DETAILED DESCRIPTION

An invalidation mechanism is described. Reference in the specificationto “one embodiment” or “an embodiment” means that a particular feature,structure, or characteristic described in connection with the embodimentis included in at least one embodiment of the invention. The appearancesof the phrase “in one embodiment” in various places in the specificationare not necessarily all referring to the same embodiment.

In the following description, numerous details are set forth. It will beapparent, however, to one skilled in the art, that the present inventionmay be practiced without these specific details. In other instances,well-known structures and devices are shown in block diagram form,rather than in detail, in order to avoid obscuring the present invention

FIG. 1 is a block diagram of one embodiment of a computer system 100.Computer system 100 includes central processing units (CPUs) 102 coupledto bus 105. In one embodiment, CPUs 102 are processors in the Pentium®family of processors including the Pentium® II processor family,Pentium® III processors, and Pentium® IV processors available from IntelCorporation of Santa Clara, Calif. Alternatively, other CPUs may beused.

According to one embodiment, bus 105 includes a high-bandwidth memorybus component and an interrupt controller communications component(ICC). Shared memory 115 is coupled to bus 105.

Memory 115 stores data and sequences of instructions and coderepresented by data signals that may be executed by the multiple CPUs102 or any other device included in system 100. In one embodiment,shared memory 115 includes dynamic random access memory (DRAM); however,shared memory 115 may be implemented using other memory types.

In a further embodiment, one or more input/output (I/O) interfaces 119are coupled to bus 105. An interface 119 provides an interface todevices within computer system 100. For instance, I/O interface 119 maybe coupled to a Peripheral Component Interconnect bus adhering to aSpecification Revision 2.1 bus developed by the PCI Special InterestGroup of Portland, Oreg.

As discussed above, an issue exists for invalidating DTBs in ashared-memory multiprocessor system (e.g., invalidation may take tens ofthousands of cycles since corresponding entries in DTBs other processorsmust be invalidated whenever one processor changes a page table entrycorresponding to a shared virtual page).

In current processors, there typically is no hardware mechanism toinvalidate DTB entries from the outside of a processor, unlike themanner in which cache blocks in a processor's cache may be invalidated.Consequently, processors invoke a heavyweight inter-processor interrupton a remote processor having DTB entries that are to be invalidated. Thecorresponding interrupt handler performs the invalidation.

Such an inter-processor interrupt to invalidate DTB entries is raised onevery processor in a shared-memory multiprocessor system since theprocessor has no knowledge about which processors have cached a copy ofa page table entry in their respective DTBs. In some instances, it maybe possible to optimize the number of interrupts by keeping the identityof the number of sharers in the page table. However, the processor mustat least invalidate all processors caching a copy of the DTB entry to beinvalidated.

Past measurements have measured the performance of such DTBinvalidations (more commonly known as DTB shootdowns). For example, fora 16-processor Encore Multimax a DTB shootdown time of 1.6 millisecondshas measured, the amount of time tens of millions of instructions may beexecuted on a single processor.

Thus, a DTB shootdown is a very expensive operation in currentmultiprocessor systems. As shared-memory multiprocessors become morepervasive, integrated circuit multiprocessors become more common, andlarger number of processors are integrated in a single system, the DTBshootdown operation will become a performance limiter for certain largeapplications and operating systems.

One way to reduce the cost of the DTB shootdown is the implementation ofa hardware solution. For instance, when a processor needs to invalidateDTB entries on other processors, the processor issues a DTB invalidationrequest (very similar to a cache block invalidation request) to otherprocessors. However, such a mechanism does not solve the problem.

First, the DTB is typically searched (or CAM-ed) using virtualaddresses. The physical address that comes with the DTB invalidationrequest is not something that a standard DTB can CAM against. It may bepossible to add a second CAM operation on the DTB for the physicaladdress. However, that may increase the latency of a regular DTB accessand thereby stretch the pipeline by one or more cycles. Alternatively,the entire DTB can be invalidated, which is not a very appealingsolution because valid DTB entries will be unnecessarily invalidated.

Second, to allow external invalidates to snoop the DTB, a second port,or multiplexing of the single read port between DTB read and invalidaterequests, would be needed. However, both solutions are undesirable.Adding a second port may increase the size of the DTB, thereby forcing alonger access time (for the CAM). The multiplexing option would slow DTBaccesses from the processor.

According to one embodiment, a hardware structure is coupled to each CPU102 in computer system 100. FIG. 2 illustrates one embodiment of a CPU102 includes a DTB 210. DTB 210 is a hardware structure that cachesvirtual to physical page translations. In addition, a cache 220 iscoupled to CPU 102. Further, DTB snoop filter 230 is coupled to CPU 102.

In one embodiment, DTB snoop filter 230 is a hardware structure thatmirrors DTB 210. Accordingly, DTB snoop filter 230 is loaded with anentry each time DTB 210 is loaded on a miss. In a further embodiment,DTB snoop 230 filter acknowledges DTB invalidation requests so that aninitiating CPU can make progress.

However in one embodiment, DTB snoop filter 230 includes only physicaladdresses. Thus unlike DTB 210, DTB scoop filter 230 does not includeany other payload. In addition, DTB snoop filter 230 is searched againsta physical address that is to be invalidated.

According to one embodiment, if both DTB 210 and DTB snoop filter 230have a FIFO replacement policy, entries will be evicted correctly fromboth the structures. However, if DTB 210 and DTB snoop filter 230 have arandom replacement policy, there is no direct guarantee that the correctentries are replaced to guarantee that DTB 210 and DTB snoop filter 230have exactly the same entries. Thus in such an embodiment, a solution isto replace the same exact entry in DTB snoop filter 230 as in DTB 210.

According to one embodiment, every external DTB invalidate operationwill be searched at DTB snoop filter 230. A match will indicate that theDTB 210 has a corresponding entry that must be invalidated.Subsequently, CPU 102 will flush all non-committed instructions, findand invalidate the corresponding entries from DTB 210 and DTB snoopfilter 230, and restart.

FIG. 3 is a flow diagram illustrating one embodiment of the operation ata CPU 102 and corresponding DTB snoop filter 230 upon receiving aninvalidate operation. At processing block 310, an invalidate operationfrom another CPU (e.g., CPU 102(2)) is received (e.g., CPU 102(1)). Asdiscussed above, the invalidate operation may be the result of acorresponding page table entry being changed at CPU 102(1).

At processing block 320, DTB snoop filter 230 is searched for the entryto be invalidated. In one embodiment, DTB snoop filter 230 is searchedvia a CAM operation. At processing block 330, it is determined whetherthe entry is stored within DTB snoop filter 230. If the entry is notlocated within DTB snoop filter 230, no action is taken and control isreturned to processing block 310 where another operation may bereceived.

If, however, the table entry is found within DTB snoop filter 230, allnon-committed instructions are flushed from CPU 102, processing block340. According to one embodiment, DTB snoop filter 230 has an index intoDTB 210. Thus, if the table entry is found in DTB snoop filter 230,there is no need to search DTB 210. Instead, DTB snoop filter simplypicks up the entry.

At processing block 350, the corresponding table entry is invalidated atDTB 210 and DTB snoop filter 230. According to one embodiment, DTB snoopfilter 230 transmits an interrupt to CPU 102. In response, CPU 102 haltsoperation while the entry is removed from DTB 210. In anotherembodiment, DTB snoop filter 230 directly invalidates DTB 210. In suchan embodiment, DTB snoop filter 230 uses a standard write port todirectly access DTB 210. Thus, there is no need for CPU 102 to stop.

The above-described mechanism features a hardware CAM structure that anincoming DTB invalidation request snoops against. Thus, unnecessaryshootdowns are filtered out and only shootdowns that will invalidate atrue DTB entry in the processor are scheduled.

Whereas many alterations and modifications of the present invention willno doubt become apparent to a person of ordinary skill in the art afterhaving read the foregoing description, it is to be understood that anyparticular embodiment shown and described by way of illustration is inno way intended to be considered limiting. Therefore, references todetails of various embodiments are not intended to limit the scope ofthe claims which in themselves recite only those features regarded asthe invention.

1. A computer system comprising: a first central processing unit (CPU)having a translation buffer (TB) to store virtual to physical addresstranslations; and a snoop filter, coupled to the first CPU, to mirrorthe operation of the first TB and implemented to search for entries uponreceiving an invalidation request from a second CPU.
 2. The computersystem of claim 1 wherein a match found at the snoop filter during asearch for entries indicates that an entry is to be invalidated at thesnoop filter and the TB.
 3. The computer system of claim 2 whereinnon-committed instructions at the first CPU are flushed prior to theentry being invalidated at the snoop filter and the TB.
 4. The computersystem of claim 1 wherein the snoop filter acknowledges invalidationrequests received from the second CPU.
 5. The computer system of claim 1wherein the snoop filter is loaded with an entry each time the TB isloaded on a miss.
 6. The computer system of claim 5 wherein the snoopfilter and the TB implement a first in first out (FIFO) replacementpolicy to evict entries.
 7. The computer system of claim 5 wherein thesnoop filter and the TB implement a random replacement policy to evictentries.
 8. The computer system of claim 7 wherein the same entrieswithin snoop filter and the TB are replaced.
 9. The computer system ofclaim 1 wherein the snoop filter comprises only physical addresses. 10.A method comprising: receiving an invalidation request at a firstcentral processing unit (CPU) from a second CPU to invalidate an entrywithin a translation buffer (TB) at the first CPU; searching a snoopfilter coupled to the first CPU to find the entry; and invalidating theentry at the TB and the snoop filter if the entry is found within thesnoop filter.
 11. The method of claim 10 further comprising flushingnon-committed instructions at the first CPU prior to the entry beinginvalidated at the snoop filter and the TB.
 12. The method of claim 10wherein invalidating the entry at the TB comprises: transmitting aninterrupt from the snoop filter to the first CPU; and halting theoperation of the first CPU; and removing the entry from the TB.
 13. Themethod of claim 10 wherein invalidating the entry at the TB comprisesthe snoop filter directly accessing the TB to invalidate the entry. 14.The method of claim 13 wherein the snoop filter uses a standard writeport to access the TB.
 15. A snoop filter comprising a table comprisingphysical address entries corresponding to entries stored in atranslation buffer (TB) implemented to store virtual to physical addresstranslations, the table to mirror the operation of the first TB andimplemented to search for entries upon receiving an invalidation requestfrom a second CPU.
 16. The snoop filter of claim 15 wherein a matchfound at the snoop filter during a search for entries indicates that anentry is to be invalidated at the snoop filter and the TB.
 17. The snoopfilter of claim 15 wherein the snoop filter is loaded with an entry eachtime the TB is loaded on a miss.
 18. The snoop filter of claim 17wherein the snoop filter and the TB implement a first in first out(FIFO) replacement policy to evict entries.
 19. The snoop filter ofclaim 17 wherein the snoop filter and the TB implement a randomreplacement policy to evict entries.
 20. The snoop filter of claim 19wherein the same entries within snoop filter and the TB are replaced.21. A computer system comprising: a first central processing unit (CPU);a second CPU having a translation buffer (TB) to store virtual tophysical address translations; a main memory device coupled to the firstCPU and the second CPU; and a snoop filter, coupled to the second CPU,to mirror the operation of the first TB and implemented to search forentries upon receiving an invalidation request from the first CPU. 22.The computer system of claim 21 wherein a match found at the snoopfilter during a search for entries indicates that an entry is to beinvalidated at the snoop filter and the TB.
 23. The computer system ofclaim 22 wherein non-committed instructions at the first CPU are flushedprior to the entry being invalidated at the snoop filter and the TB. 24.The computer system of claim 21 wherein the snoop filter acknowledgesinvalidation requests received from the first CPU.
 25. The computersystem of claim 21 wherein the snoop filter comprises only physicaladdresses.